TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping

نویسندگان

Md Pavel Mahmud

Alexander Schliep

چکیده

As high-throughput sequencers become standard equipment outside of sequencing centers, there is an increasing need for efficient methods for pre-processing and primary analysis. While a vast literature proposes methods for HTS data analysis, we argue that significant improvements can still be gained by exploiting expensive pre-processing steps which can be amortized with savings from later stages. We propose a method to accelerate and improve read mapping based on an initial clustering of possibly billions of high-throughput sequencing reads, yielding clusters of high stringency and a high degree of overlap. This clustering improves on the state-of-the-art in running time for small datasets and, for the first time, makes clustering high-coverage human libraries feasible. Given the efficiently computed clusters, only one representative read from each cluster needs to be mapped using a traditional readmapper such as BWA, instead of individually mapping all reads. On human reads, all processing steps, including clustering and mapping, only require 11%–59% of the time for individually mapping all reads, achieving speed-ups for all readmappers, while minimally affecting mapping quality. This accelerates a highly sensitive readmapper such as Stampy to be competitive with a fast readmapper such as BWA on unclustered reads.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast and accurate mapping of Complete Genomics reads.

Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologie...

متن کامل

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

MOTIVATION Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (in...

متن کامل

A Non-volatile Near-Memory Read Mapping Accelerator

DNA sequencing is the physical or biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, DNA sequencing technology (termed Next Generation Sequencing, NGS) revolutionized genomic research. Modern NGS platforms can sequence hundreds of millions of short DNA fragments in paralle...

متن کامل

Designing Efficient Spaced Seeds for SOLiD Read Mapping

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the ...

متن کامل

SNP-o-matic

MOTIVATION High throughput sequencing technologies generate large amounts of short reads. Mapping these to a reference sequence consumes large amounts of processing time and memory, and read mapping errors can lead to noisy or incorrect alignments. SNP-o-matic is a fast, memory-efficient and stringent read mapping tool offering a variety of analytical output functions, with an emphasis on genot...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1404.2872 شماره

صفحات -

تاریخ انتشار 2014

TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping

نویسندگان

چکیده

منابع مشابه

Fast and accurate mapping of Complete Genomics reads.

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

A Non-volatile Near-Memory Read Mapping Accelerator

Designing Efficient Spaced Seeds for SOLiD Read Mapping

SNP-o-matic

عنوان ژورنال:

اشتراک گذاری